! pip install Unidecode
! pip install arabic-reshaper
! pip install python-bidi
! pip install -U kaleido
Requirement already satisfied: Unidecode in c:\users\mohsen\appdata\local\programs\python\python310\lib\site-packages (1.3.6) Requirement already satisfied: arabic-reshaper in c:\users\mohsen\appdata\local\programs\python\python310\lib\site-packages (2.1.4) Requirement already satisfied: future in c:\users\mohsen\appdata\local\programs\python\python310\lib\site-packages (from arabic-reshaper) (0.18.2) Requirement already satisfied: python-bidi in c:\users\mohsen\appdata\local\programs\python\python310\lib\site-packages (0.4.2) Requirement already satisfied: six in c:\users\mohsen\appdata\local\programs\python\python310\lib\site-packages (from python-bidi) (1.16.0) Requirement already satisfied: kaleido in c:\users\mohsen\appdata\local\programs\python\python310\lib\site-packages (0.2.1)
import requests
import json
import time
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from unidecode import unidecode
from bidi.algorithm import get_display
from arabic_reshaper import reshape
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
plt.style.use("ggplot")
The code for data scraping from divar is available at divar_scraper.ipynb.
At first I investigated the possibility of crawling the website and extracting the data. As seen in the picture below, after scrolling, divar client sends a request to the API.
Based on this API, I figured out I can access the data of the search pages by sending a POST request to https://api.divar.ir/v8/web-search/1/apartment-sell.
Also, the payload looks like below:
{ "json_schema":{ "category":{"value":"apartment-sell"}, "cities":["1"] }, "last-post-date":1671722824386533 }As I have selected apartment-sell category to analyze, the category is "apartment-sell".
Moreover, the city was Tehran which has the code of 1. The codes for all the cities are available at https://api.divar.ir/v8/places/cities. I chose to use the below cities for gathering the data.
cities = { 1: "Tehran", 2: "Karaj", 3: "Mashhad", 4: "Isfahan", 5: "Tabriz", 6: "Shiraz", 7: "Ahvaz", 8: "Qom", }Finally, "last-post-date" indicates the epoch time of the last post shown to the user in the page. It is used for when the user scrolls to see new posts with older last-post_date than the last post they saw. I used this mechanism in my scraper to get new posts from the api.
For compeleteness, the code of divar-scraper is available below as well. (The API cannot be accessed from Google Colab. Please run this part from a local machine.
import requests
import json
import time
import pandas as pd
from tqdm.auto import tqdm
place2_map = {
1: "Tehran",
2: "Karaj",
3: "Mashhad",
4: "Isfahan",
5: "Tabriz",
6: "Shiraz",
7: "Ahvaz",
8: "Qom",
}
def request_api(last_post_date=0):
url = 'https://api.divar.ir/v8/web-search/1/apartment-sell'
my_obj = {"json_schema":{"category":{"value":"apartment-sell"}, "cities":[str(i) for i in place2_map.keys()]}, "last-post-date":last_post_date}
response = requests.post(url, json = my_obj)
response_json = json.loads(response.text)
return response_json
def flatten_dict(dd, separator='.', prefix=''):
return {
prefix + separator + k if prefix else k : v
for kk, vv in dd.items()
for k, v in flatten_dict(vv, separator, kk).items()
} if isinstance(dd, dict) else { prefix : dd }
def retrieve_raw_dataset(iterations=60) -> pd.DataFrame:
raw_dataset = []
last_post_date=0
for iteration in tqdm(range(iterations), desc="Iterations"):
got_response = False
while not got_response:
try:
response_json = request_api(last_post_date)
for post_row in response_json["web_widgets"]['post_list']:
assert post_row['widget_type'] == "POST_ROW"
data = flatten_dict(post_row)
data["raw_response"] = response_json
raw_dataset.append(data)
last_post_date = response_json["last_post_date"] # Next Scroll
time.sleep(1)
got_response = True
except Exception as e:
print("API Call Failed!")
return pd.DataFrame(raw_dataset)
### UNCOMMENT BELOW FOR TESTING
### THE RETRIEVED DATA IS AVAILABLE AT https://github.com/mohsenfayyaz/data-analytics-course-ut/blob/main/Project_EDA/divar_apartment_sell_24000.csv
# df = retrieve_raw_dataset(iterations=2)
# df.to_pickle("divar_data.pkl")
# df
Each API call returns 24 posts. In the end, I called the API for 1000 times (30min) and acquired 24,000 unique posts available at: \ https://github.com/mohsenfayyaz/data-analytics-course-ut/blob/main/Project_EDA/divar_apartment_sell_24000.csv
# Download the Scraped Data
! wget https://github.com/mohsenfayyaz/data-analytics-course-ut/raw/main/Project_EDA/divar_apartment_sell_24000.csv
'wget' is not recognized as an internal or external command, operable program or batch file.
df = pd.read_csv("./divar_apartment_sell_24000.csv", low_memory=False, index_col=0)
df.head(2)
| widget_type | data.@type | data.image_url | data.image_count | data.title | data.has_chat | data.top_description_text | data.middle_description_text | data.bottom_description_text | data.red_text | ... | data.image_top_left_tag.icon.icon_color | data.image_top_left_tag.text | data.image_top_left_tag.is_icon_left | data.image_overlay_tag.text | data.image_overlay_tag.image_url | data.image_overlay_tag.image_color | data.image_overlay_tag.icon.image_url_dark | data.image_overlay_tag.icon.image_url_light | data.image_overlay_tag.icon.icon_name | data.image_overlay_tag.icon.icon_color | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | POST_ROW | type.googleapis.com/widgets.PostRowData | NaN | 0 | کلاهدوز صفر | True | NaN | ۲,۸۵۰,۰۰۰,۰۰۰ تومان | آژانس املاک در مشهد | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | POST_ROW | type.googleapis.com/widgets.PostRowData | [{'src': 'https://s100.divarcdn.com/static/thu... | 6 | فروش ۱۰۰ متر آپارتمان ۲خواب // قیطریه جنوبی | False | NaN | ۱۳,۲۰۰,۰۰۰,۰۰۰ تومان | آژانس املاک در تهران | NaN | ... | WHITE_PRIMARY | ۶ | True | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 57 columns
What the API returns is targeted for the front-end to visualize. Also, there might be some data that are not necassary. In this part we find unique values of each column and delete the ones that add no information.
"data.title" and "data.action.payload.web_info.title" are the same for all the rows. So, we drop "data.action.payload.web_info.title". Also, some ui columns are not needed that we drop here.
for col_name in df.columns:
uniques = df[col_name].unique()
if len(uniques) == 1:
print(f"Dropping column [{col_name}] which has only 1 value = {uniques}")
df = df.drop(columns=col_name)
assert (df["data.title"] == df["data.action.payload.web_info.title"]).all() == True # Title is duplicated
df = df.drop(
columns=[
"data.action.payload.web_info.title",
"action_log.server_side_info.info.index",
"data.image_top_left_tag.icon.image_url_light",
"data.image_top_left_tag.icon.image_url_dark",
"data.image_overlay_tag.icon.image_url_dark",
"data.image_overlay_tag.icon.image_url_light",
"data.image_url",
"data.image_top_left_tag.icon.icon_color",
"data.image_overlay_tag.image_url",
"data.image_overlay_tag.image_color",
"data.image_overlay_tag.icon.icon_name",
"data.image_overlay_tag.icon.icon_color",
"data.image_top_left_tag.text",
"data.image_top_left_tag.text",
"action_log.server_side_info.info.post_token",
"data.image_top_left_tag.is_icon_left",
"data.action.payload.token"
]
)
Dropping column [widget_type] which has only 1 value = ['POST_ROW'] Dropping column [data.@type] which has only 1 value = ['type.googleapis.com/widgets.PostRowData'] Dropping column [data.top_description_text] which has only 1 value = [nan] Dropping column [data.action.type] which has only 1 value = ['VIEW_POST'] Dropping column [data.action.payload.@type] which has only 1 value = ['type.googleapis.com/widgets.ViewPostPayload'] Dropping column [data.action.payload.web_info.category_slug_persian] which has only 1 value = ['آپارتمان'] Dropping column [data.action.payload.vertical_info.@type] which has only 1 value = ['type.googleapis.com/base.VerticalInfo'] Dropping column [data.action.payload.vertical_info.vertical_type] which has only 1 value = [2] Dropping column [data.action.fallback_link] which has only 1 value = [nan] Dropping column [data.action.page_pop_link] which has only 1 value = [False] Dropping column [data.checkable] which has only 1 value = [False] Dropping column [data.label] which has only 1 value = [nan] Dropping column [data.label_color] which has only 1 value = [nan] Dropping column [data.note] which has only 1 value = [nan] Dropping column [data.tags] which has only 1 value = ['[]'] Dropping column [data.standard_label_color] which has only 1 value = ['UNKNOWN'] Dropping column [data.is_checked] which has only 1 value = [False] Dropping column [data.has_divider] which has only 1 value = [ True] Dropping column [data.padded] which has only 1 value = [False] Dropping column [action_log.server_side_info.info.@type] which has only 1 value = ['type.googleapis.com/action_log.PostItemInfo'] Dropping column [action_log.server_side_info.info.post_type] which has only 1 value = ['SEARCH'] Dropping column [action_log.server_side_info.info.list_type] which has only 1 value = ['PAGE'] Dropping column [action_log.server_side_info.info.source_page] which has only 1 value = ['POST_LIST_PAGE'] Dropping column [action_log.server_side_info.info.extra_data.@type] which has only 1 value = ['type.googleapis.com/action_log.PostItemInfo.SearchExtraData'] Dropping column [action_log.server_side_info.info.extra_data.jli.category.value] which has only 1 value = ['apartment-sell'] Dropping column [action_log.server_side_info.info.extra_data.jli.cities] which has only 1 value = ["['1', '2', '3', '4', '5', '6', '7', '8']"] Dropping column [action_log.server_side_info.info.extra_data.no_cat_page] which has only 1 value = [False] Dropping column [action_log.server_side_info.item_type.type] which has only 1 value = ['POST_ITEM'] Dropping column [action_log.enabled] which has only 1 value = [ True]
There are multiple columns curated for UI and here we need to extract the data we need from them. This information includes:
- "data.middle_description_text":
includes the price in persian. such as
۲,۹۰۰,۰۰۰,۰۰۰ تومان or توافقی
which needs to be converted to an integer.
- data.bottom_description_text:
includes the type of business of who posted. Such as آژانس املاک / مشاور املاک / شخصی which needs to be extracted from texts such as مشاور املاک در تهران
- data.red_text:
includes nan or فوری which should be converted to a column "promoted" and has a binary value.
- action_log.server_side_info.info.extra_data.last_post_sort_date:
includes epoch time of approximate time of retreieving the data. It needs to be converted to datetime type.
- data.image_top_left_tag.icon.icon_name:
includes nan 'CAMERA' 'VIDEOCAM' 'VR'. The column needs to be renamed to media.
- data.image_overlay_tag.text:
includes nan 'صحت اطلاعات', which needs to be renamed to inspection and be binary column.
for col_name in df.columns:
uniques = df[col_name].unique()
print(f"{col_name.ljust(65)} > {uniques[:4]}...")
df.head(2)
data.image_count > [ 0 6 16 1]... data.title > ['کلاهدوز صفر' 'فروش ۱۰۰ متر آپارتمان ۲خواب // قیطریه جنوبی' 'آپارتمان ۱۱۰ متری کلید نخورده در خیابان مرتضوی' '116متر ، نوساز ، فول امکانات / شهرک صدف']... data.has_chat > [ True False]... data.middle_description_text > ['۲,۸۵۰,۰۰۰,۰۰۰ تومان' '۱۳,۲۰۰,۰۰۰,۰۰۰ تومان' '۵,۱۳۰,۰۰۰,۰۰۰ تومان' '۲,۹۰۰,۰۰۰,۰۰۰ تومان']... data.bottom_description_text > ['آژانس املاک در مشهد' 'آژانس املاک در تهران' 'مشاور املاک در تهران' 'لحظاتی پیش در کرج']... data.red_text > [nan 'فوری ']... data.action.payload.web_info.district_persian > ['کلاهدوز' 'قیطریه' 'هاشمی' 'شهر جدید اندیشه']... data.action.payload.web_info.city_persian > ['مشهد' 'تهران' 'کرج' 'اصفهان']... data.token > ['wYaKSWxx' 'wYVu85pC' 'wYki1hsI' 'wYkilp7n']... action_log.server_side_info.info.extra_data.last_post_sort_date > [1671715067723008 1671715046397571 1671715016893136 1671714969372055]... data.image_top_left_tag.icon.icon_name > [nan 'CAMERA' 'VIDEOCAM' 'VR']... data.image_overlay_tag.text > [nan 'صحت اطلاعات']...
| data.image_count | data.title | data.has_chat | data.middle_description_text | data.bottom_description_text | data.red_text | data.action.payload.web_info.district_persian | data.action.payload.web_info.city_persian | data.token | action_log.server_side_info.info.extra_data.last_post_sort_date | data.image_top_left_tag.icon.icon_name | data.image_overlay_tag.text | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | کلاهدوز صفر | True | ۲,۸۵۰,۰۰۰,۰۰۰ تومان | آژانس املاک در مشهد | NaN | کلاهدوز | مشهد | wYaKSWxx | 1671715067723008 | NaN | NaN |
| 1 | 6 | فروش ۱۰۰ متر آپارتمان ۲خواب // قیطریه جنوبی | False | ۱۳,۲۰۰,۰۰۰,۰۰۰ تومان | آژانس املاک در تهران | NaN | قیطریه | تهران | wYVu85pC | 1671715067723008 | CAMERA | NaN |
def persian_to_price(x: str) -> int:
if x == "توافقی":
return np.nan
x = unidecode(x)
x = ''.join(c for c in x if c.isdigit())
return int(x.strip())
def infer_business_type(x: str):
if "آژانس" in x:
return "agency"
if "مشاور" in x:
return "consultant"
return "personal"
df["price_persian"] = df["data.middle_description_text"]
df["price_million"] = df["data.middle_description_text"].map(lambda x: persian_to_price(x)) / 1000000
df["business_type"] = df["data.bottom_description_text"].map(lambda x: infer_business_type(x))
df["promoted"] = df["data.red_text"] == 'فوری '
df["last_post_sort_date"] = pd.to_datetime(df["action_log.server_side_info.info.extra_data.last_post_sort_date"], unit='us')
df["media"] = df["data.image_top_left_tag.icon.icon_name"]
df["inspection"] = df["data.image_overlay_tag.text"] == 'صحت اطلاعات'
df = df.drop(columns=[
"data.red_text",
"action_log.server_side_info.info.extra_data.last_post_sort_date",
"data.image_top_left_tag.icon.icon_name",
"data.image_overlay_tag.text", "data.middle_description_text",
"data.bottom_description_text",
])
for col_name in df.columns:
df = df.rename(columns={col_name: col_name.split(".")[-1]})
df = df.reindex(sorted(df.columns, reverse=True), axis=1)
df
| token | title | promoted | price_persian | price_million | media | last_post_sort_date | inspection | image_count | has_chat | district_persian | city_persian | business_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | wYaKSWxx | کلاهدوز صفر | False | ۲,۸۵۰,۰۰۰,۰۰۰ تومان | 2850.0 | NaN | 2022-12-22 13:17:47.723008 | False | 0 | True | کلاهدوز | مشهد | agency |
| 1 | wYVu85pC | فروش ۱۰۰ متر آپارتمان ۲خواب // قیطریه جنوبی | False | ۱۳,۲۰۰,۰۰۰,۰۰۰ تومان | 13200.0 | CAMERA | 2022-12-22 13:17:47.723008 | False | 6 | False | قیطریه | تهران | agency |
| 2 | wYki1hsI | آپارتمان ۱۱۰ متری کلید نخورده در خیابان مرتضوی | False | ۵,۱۳۰,۰۰۰,۰۰۰ تومان | 5130.0 | CAMERA | 2022-12-22 13:17:47.723008 | False | 16 | True | هاشمی | تهران | consultant |
| 3 | wYkilp7n | 116متر ، نوساز ، فول امکانات / شهرک صدف | False | ۲,۹۰۰,۰۰۰,۰۰۰ تومان | 2900.0 | CAMERA | 2022-12-22 13:17:47.723008 | False | 1 | False | شهر جدید اندیشه | کرج | personal |
| 4 | wYkiFo1Z | قیطریه (کیکاووس) ۱۵۰ متر فول بازسازی | False | ۱۲,۶۰۰,۰۰۰,۰۰۰ تومان | 12600.0 | NaN | 2022-12-22 13:17:47.723008 | False | 0 | True | قیطریه | تهران | agency |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23995 | wYjqBo7O | 141متر/تاپ لوکیشن/دو پارکینگ/تک بازدید | False | ۴,۰۵۰,۰۰۰,۰۰۰ تومان | 4050.0 | CAMERA | 2022-12-21 14:41:25.821502 | False | 1 | False | محله هنرستان | مشهد | agency |
| 23996 | wYLiyCOY | آپارتمان 100متری صفرتاصد بازسازی نزدیک اصلی | False | ۱,۳۹۰,۰۰۰,۰۰۰ تومان | 1390.0 | CAMERA | 2022-12-21 14:41:25.821502 | False | 1 | False | باهنر | اهواز | agency |
| 23997 | wYjixctJ | آپارتمان ۱۶۰ متر مهرشهر | False | ۲,۸۰۰,۰۰۰,۰۰۰ تومان | 2800.0 | NaN | 2022-12-21 14:41:25.821502 | False | 0 | False | کیان آباد | اهواز | personal |
| 23998 | wYjqhwMA | آپارتمان ۱30 متری الهیه | False | ۳,۰۰۰,۰۰۰,۰۰۰ تومان | 3000.0 | NaN | 2022-12-21 14:41:25.821502 | False | 0 | True | الهیه | مشهد | personal |
| 23999 | wYjqhkBe | خرید واحدخانه خونه ۱۶۱متر سه خوابه حیاط بالکن ... | False | ۱۲,۸۰۰,۰۰۰,۰۰۰ تومان | 12800.0 | CAMERA | 2022-12-21 14:41:25.821502 | False | 5 | False | سازمان آب | تهران | consultant |
24000 rows × 13 columns
df
| token | title | promoted | price_persian | price_million | media | last_post_sort_date | inspection | image_count | has_chat | district_persian | city_persian | business_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | wYaKSWxx | کلاهدوز صفر | False | ۲,۸۵۰,۰۰۰,۰۰۰ تومان | 2850.0 | NaN | 2022-12-22 13:17:47.723008 | False | 0 | True | کلاهدوز | مشهد | agency |
| 1 | wYVu85pC | فروش ۱۰۰ متر آپارتمان ۲خواب // قیطریه جنوبی | False | ۱۳,۲۰۰,۰۰۰,۰۰۰ تومان | 13200.0 | CAMERA | 2022-12-22 13:17:47.723008 | False | 6 | False | قیطریه | تهران | agency |
| 2 | wYki1hsI | آپارتمان ۱۱۰ متری کلید نخورده در خیابان مرتضوی | False | ۵,۱۳۰,۰۰۰,۰۰۰ تومان | 5130.0 | CAMERA | 2022-12-22 13:17:47.723008 | False | 16 | True | هاشمی | تهران | consultant |
| 3 | wYkilp7n | 116متر ، نوساز ، فول امکانات / شهرک صدف | False | ۲,۹۰۰,۰۰۰,۰۰۰ تومان | 2900.0 | CAMERA | 2022-12-22 13:17:47.723008 | False | 1 | False | شهر جدید اندیشه | کرج | personal |
| 4 | wYkiFo1Z | قیطریه (کیکاووس) ۱۵۰ متر فول بازسازی | False | ۱۲,۶۰۰,۰۰۰,۰۰۰ تومان | 12600.0 | NaN | 2022-12-22 13:17:47.723008 | False | 0 | True | قیطریه | تهران | agency |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23995 | wYjqBo7O | 141متر/تاپ لوکیشن/دو پارکینگ/تک بازدید | False | ۴,۰۵۰,۰۰۰,۰۰۰ تومان | 4050.0 | CAMERA | 2022-12-21 14:41:25.821502 | False | 1 | False | محله هنرستان | مشهد | agency |
| 23996 | wYLiyCOY | آپارتمان 100متری صفرتاصد بازسازی نزدیک اصلی | False | ۱,۳۹۰,۰۰۰,۰۰۰ تومان | 1390.0 | CAMERA | 2022-12-21 14:41:25.821502 | False | 1 | False | باهنر | اهواز | agency |
| 23997 | wYjixctJ | آپارتمان ۱۶۰ متر مهرشهر | False | ۲,۸۰۰,۰۰۰,۰۰۰ تومان | 2800.0 | NaN | 2022-12-21 14:41:25.821502 | False | 0 | False | کیان آباد | اهواز | personal |
| 23998 | wYjqhwMA | آپارتمان ۱30 متری الهیه | False | ۳,۰۰۰,۰۰۰,۰۰۰ تومان | 3000.0 | NaN | 2022-12-21 14:41:25.821502 | False | 0 | True | الهیه | مشهد | personal |
| 23999 | wYjqhkBe | خرید واحدخانه خونه ۱۶۱متر سه خوابه حیاط بالکن ... | False | ۱۲,۸۰۰,۰۰۰,۰۰۰ تومان | 12800.0 | CAMERA | 2022-12-21 14:41:25.821502 | False | 5 | False | سازمان آب | تهران | consultant |
24000 rows × 13 columns
Let us review the cleaned columns:
- token: Post token. Can be used in the url https://divar.ir/v/<token>
- title: Post title
- promoted: Post is promoted or not. (فوری)
- price_persian: Price in persian
- price_million: Price in million
- media: What type media the post has (nan, 'CAMERA', 'VIDEOCAM', 'VR')
- last_post_sort_date: Date of retrieving the post
- inspection: If the house is inspected by Divar (صحت اطلاعات)
- image_count: Number of images for the post
- has_chat: If the post has text chat enabled or not
- district_persian: Distict of the post in the city
- city_persian: Name of post's city
- business_type: The business type of the owner of the post (agency, consultant, personal)
def to_fa(labels: list):
return [get_display(reshape(label)) for label in labels]
vc = df["city_persian"].value_counts()
plt.figure(figsize=(8, 5))
plt.bar(to_fa(vc.keys()), vc.values)
plt.title("Number of Posts in Each City")
plt.xlabel("City"); plt.ylabel("Number of Posts")
plt.show()
Number of posts in Tehran is significanlty more than other cities. Additionally, the order of cities are exactly similar to the population of cities based on cencsus of 1395 except for Karaj which is in the second place of posts but had 4th place in population. (This should be more investigated by Divar to see if Karaj posts are real or some may add Tehran posts in Karaj as well for getting more views.)
![]()
vc = df["business_type"].value_counts()
vc
plt.figure(figsize=(5, 3))
plt.bar(to_fa(vc.keys()), vc.values)
plt.title("Number of Posts by Each Business Type")
plt.xlabel("Business Type"); plt.ylabel("Number of Posts")
plt.show()
Agencies post most of the posts on Divar real estate category. Personal posts are drastically less than agencies. (Consultants are similar to agencies)
However, the data of cities has been imbalanced and Tehran can change the distribution. Therefore, we plot this figure for each city below.
fig = px.histogram(df, y="city_persian", color='business_type',
barmode='group', width=1000, height=400,
title="Business Type per City").update_yaxes(categoryorder='total ascending')
fig.show(renderer="png")
stack100 = (df.groupby(['city_persian','business_type']).size()/df.groupby(['city_persian']).size()).unstack().sort_values("agency")
stack100.index = to_fa(stack100.index)
stack100.plot(kind="barh", stacked=True, figsize=(10, 5))
for col in stack100:
for i, (cs, ab, pc) in enumerate(zip(stack100.cumsum(axis=1)[col], stack100[col], stack100[col])):
plt.text(cs - ab / 2, i, str(np.round(pc * 100, 1)) + '%', va = 'center', ha = 'center')
plt.title("Business Type Ratio per City")
plt.xlabel("Business Type Ratio"); plt.ylabel("City")
plt.show()
Interestingly, Tehran is an exception in having the largest agency ratio. Only 7.6% of posts are made by individual persons. About 92.4% are agencies or consultants. On the other hand, Shiraz has the most ratio of personal, and least ratio of agencies. This shows that in Tehran there are more agencies active. Also, having an agency account might be expensive or have specific benefits that Tehran agencies have decided to announce that. For the personal posts, we do not know if they were really personal or maybe an agency created a personal account. But Divar should work on making agency accounts more appealing for those who work in other cities.
price_df = df.copy()
price_df["price"] = price_df['price_million'] / 1000
fig = px.box(price_df, x="price", width=1200, height=300, title="Price Boxplot (Billion Tomans)")
fig.show(renderer="png")
df[df["price_million"] > df["price_million"].quantile(0.999)].sample(5) # There are some mistakes in these
| token | title | promoted | price_persian | price_million | media | last_post_sort_date | inspection | image_count | has_chat | district_persian | city_persian | business_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7864 | wYkyxoh7 | گلستان شمالی ۵۵۰ متر مدرن با تراس | False | ۱۸۰,۰۰۰,۰۰۰,۰۰۰ تومان | 180000.0 | CAMERA | 2022-12-22 09:21:49.195784 | False | 5 | False | اقدسیه | تهران | agency |
| 21859 | wYjSTxtC | ۱۰۵۰ متر مستغلات/نیاوران/نوساز تکواحدی/فروش/تهاتر | False | ۱۹۰,۰۰۰,۰۰۰,۰۰۰ تومان | 190000.0 | CAMERA | 2022-12-21 17:24:44.986700 | False | 9 | False | نیاوران | تهران | agency |
| 6667 | wYrlVXzl | ۱۰۰۰متر پنت هاوس واقع در الهیه | False | ۲۸۰,۰۰۰,۰۰۰,۰۰۰ تومان | 280000.0 | CAMERA | 2022-12-22 09:50:08.056687 | False | 8 | False | الهیه | تهران | consultant |
| 2826 | wYpdCCBr | دروس ۱۵۰۰ متر فروش و معاوضه | False | ۳۰۰,۰۰۰,۰۰۰,۰۰۰ تومان | 300000.0 | CAMERA | 2022-12-22 11:54:42.652432 | False | 6 | False | دروس | تهران | consultant |
| 16975 | wYhqXClh | 3450متر بنای یکجا غرب اسپیناس | False | ۱۸۰,۰۰۰,۰۰۰,۰۰۰ تومان | 180000.0 | NaN | 2022-12-22 06:55:14.511584 | False | 0 | False | سعادتآباد | تهران | consultant |
df[df["price_million"] < df["price_million"].quantile(0.0001)] # These have no problem
| token | title | promoted | price_persian | price_million | media | last_post_sort_date | inspection | image_count | has_chat | district_persian | city_persian | business_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4145 | wYhqXNjk | ۱۵۰مترآپارتمان رزکان نو واریان شهر تک واحدی | False | ۳۵,۱۰۰,۰۰۰ تومان | 35.1 | CAMERA | 2022-12-22 11:09:06.172732 | False | 1 | False | شهرک بنفشه | کرج | agency |
| 19946 | wYZedbb7 | امتیاز آپارتمان دو خوابه بدون واریزی صدرا | False | ۲۰,۰۰۰,۰۰۰ تومان | 20.0 | NaN | 2022-12-22 01:22:35.592323 | False | 0 | True | شهر صدرا | شیراز | personal |
| 20341 | wYjK3lLQ | ۱۷۰مترتجاری | False | ۴۰,۰۰۰,۰۰۰ تومان | 40.0 | CAMERA | 2022-12-21 21:03:03.115193 | False | 3 | True | کوی مهدیس | اهواز | personal |
Some outliers like below have made a mistake and put the whole price in each meter price. Therefore, we have 4B Tomans for each meter and 161m of it would be 666B Tomans. (This happens a lot and Divar should fix it)
Another one was selling 200 units at once as 2000m which would cost 990B Tomans.
And some are really luxurious.
For following analyses, we hide outliers to better visualize the data.
price_df = df.copy().sort_values("price_million", ascending=False)
price_df["price_billion"] = price_df['price_million'] / 1000
# fig = px.box(price_df, x="city_persian", y="price")
# fig.update_traces(boxpoints=False)
# fig.show(renderer="png")
plt.figure(figsize=(9, 6))
order = price_df.groupby(["city_persian"]).median()["price_million"].sort_values(ascending=False).index
ax = sns.boxplot(x="city_persian", y="price_billion", data=price_df, showfliers=False, order=order)
ax.set_xticklabels(to_fa([t.get_text() for t in ax.get_xticklabels()]))
plt.title("Price Difference in Cities (Billion Tomans)")
plt.show()
C:\Users\Mohsen\AppData\Local\Temp\ipykernel_23056\4013583076.py:8: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Tehran has the highest prices in general, mostly ranging from 4B to 12B Tomans. It also has longer tail up to 24B Tomans. Next is Shiraz, but the prices in Shiraz are significanlty lower than Tehran. On the contrary, Tabriz seems to have the cheapest median while it's high whisker is higher than some other cities. It means generaly Tabriz has low prices, but there are also some expensive apartments there. Qom and Ahvaz upper whiskers are the lowest among all cities, showing that there aren't very expensive apartments in those cities.
Also, it is worth mentioning that the box of Tehran is the largest among all, and Ahvaz has the smalles box, which means prices in Tehran are more diverse, while most prices in Ahvaz are close together.
plt.figure(figsize=(16, 8))
vc = df[df["city_persian"] == "تهران"]["district_persian"].value_counts()[:20][::-1]
plt.subplot(1, 2, 1)
plt.barh(to_fa(vc.keys()), vc.values)
plt.title("Most Number of Posts in Tehran Districts")
plt.xlabel("District in Tehran"); plt.ylabel("Number of Posts")
vc = df[df["city_persian"] == "تهران"]["district_persian"].value_counts()[-20:][::-1]
plt.subplot(1, 2, 2)
plt.barh(to_fa(vc.keys()), vc.values)
plt.title("Least Number of Posts in Tehran Districts")
plt.xlabel("District in Tehran"); plt.ylabel("Number of Posts")
plt.tight_layout()
plt.show()
Most posts are first in Sadat Abad and Punak at District 2. There are also many posts at west of Tehran such as Chitgar Lake and Jannat Abad. This shows that north and west of Tehran are most active in real estate.
Unfortunately, the shape files for Tehran Districts are not free and we cannot leverage geopandas to visualize our data on the map.
To better see this we use https://www.openstreetmap.org/ and https://overpass-turbo.eu/ to show the most number of posts in neighborhoods.
( rel(8321827); // Sadat Abad rel(8321816); // Punak rel(6728897); // District 3 rel(6728896); // District 2 rel(8321815); // Jannat Abad rel(8321694); // Marzdaran rel(8321739); // Jeyhoon rel(4534491726); // Chitgar rel(6729506); // District 22 rel(5656701493); // Niavaran rel(8321846); // Gheytarieh rel(8321856); // Zaferanieh rel(8321681); // Yousef Abad ); out geom;
price_df = df[df["city_persian"] == "تهران"].copy().sort_values("price_million", ascending=False)
price_df["price_billion"] = price_df['price_million'] / 1000
# Filter Least frequent districts
vc = price_df["district_persian"].value_counts()
price_df = price_df[price_df["district_persian"].isin(vc[vc > 150].index)]
plt.figure(figsize=(9, 11))
order = price_df.groupby(["district_persian"]).median()["price_million"].sort_values(ascending=False).index
ax = sns.boxplot(y="district_persian", x="price_billion", data=price_df, showfliers=False, order=order)
ax.set_yticklabels(to_fa([t.get_text() for t in ax.get_yticklabels()]))
plt.title("Price Difference in Districts of Tehran (Billion Tomans)")
plt.show()
C:\Users\Mohsen\AppData\Local\Temp\ipykernel_23056\3764188865.py:9: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
To better visualize and also make it more accurate, we include only the districts which have more than 150 posts. In these districts we see that Zaferanieh, Elahie, Farmanie, Velenjak, ... have the highest overal prices. On the other hand, Andishe, Chitgar, and Salsabil had the least prices. This plot best shows how expensive it is to buy a house in each of Tehran neighborhoods.
price_df = df[df["city_persian"] == "تهران"].copy().sort_values("price_million", ascending=False)
price_df["price_billion"] = price_df['price_million'] / 1000
order = price_df.groupby(["business_type"]).median()["price_million"].sort_values(ascending=False).index
plt.figure(figsize=(6, 4))
ax = sns.boxplot(x="business_type", y="price_billion", data=price_df, showfliers=False, order=order)
ax.set_yticklabels(to_fa([t.get_text() for t in ax.get_yticklabels()]))
plt.title("Price Difference \nbetween Business Types in Tehran (Billion Tomans)")
plt.show()
C:\Users\Mohsen\AppData\Local\Temp\ipykernel_23056\3118088884.py:4: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. C:\Users\Mohsen\AppData\Local\Temp\ipykernel_23056\3118088884.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
First of all we limited this to Tehran because it had much different prices than other cities and also there were more agencies in Tehran which could make these results biased.
We see that agencies provide the highest prices, then consultants and then personal posters set the cheapest prices. The easy conclusion is that agencies put higher prices to keep room for bargaining, but personal accounts are more fair in pricing. Also we can conclude that agenices are actually fair and sell more expensive houses in general than persoanls.
We need to further see the distribution of agencies in each district.
price_df = df[df["city_persian"] == "تهران"].copy().sort_values("price_million", ascending=False)
price_df["price_billion"] = price_df['price_million'] / 1000
# Filter Least frequent districts
vc = price_df["district_persian"].value_counts()
price_df = price_df[price_df["district_persian"].isin(vc[vc > 150].index)]
order = price_df.groupby(["district_persian"]).median()["price_million"].sort_values(ascending=True).index
stack100 = (price_df.groupby(['district_persian','business_type']).size()/price_df.groupby(['district_persian']).size()).unstack().sort_values("agency")
stack100 = stack100.reindex(order)
stack100.index = to_fa(stack100.index)
stack100 = stack100.fillna(0)
stack100.plot(kind="barh", stacked=True, figsize=(13, 10))
for col in stack100:
for i, (cs, ab, pc) in enumerate(zip(stack100.cumsum(axis=1)[col], stack100[col], stack100[col])):
plt.text(cs - ab / 2, i, str(np.round(pc * 100, 1)) + '%', va = 'center', ha = 'center')
plt.title("Business Type Ratio per District \n(Districts are sorted by their median price)")
plt.xlabel("Business Type Ratio"); plt.ylabel("District")
plt.show()
C:\Users\Mohsen\AppData\Local\Temp\ipykernel_23056\3050369940.py:8: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Top rows are the most expensive districts. We see that personal posts are slightly more available in cheaper districts. This might be due to the fact that real estate agencies would have a higher revenue from more expensive apartments and choose to operate in expensive neighborhoods slightly more.
dfc = df.copy()
dfc["has_chat_"] = "Chat Disabled"
dfc.loc[dfc["has_chat"] == True, "has_chat_"] = "Chat Enabled"
fig = px.icicle(
dfc,
path=[px.Constant(f"All Posts"), "business_type", "has_chat_"],
color_discrete_sequence=px.colors.qualitative.Dark2,
width=1400, height=500,
title = "Icicle Chart of Business Types and Their Chat Enabled Status"
)
fig.update_traces(
root_color="lightblue",
tiling = dict(orientation='v'),
textinfo="percent entry+label+value+text",
# marker = {"colors": ["deepskyblue", "lightsalmon", "tan", "teal", "silver"]},
)
fig.show(renderer="png")
Another interesting result we find is that Agencies tend to disable the text chat ability in Divar (38% enabled), whereas Perosnal accounts mostly make their chats enabled (76% enabled). This can be attributed to the fact that agencies want to talk directly to the customer because it is faster to convey more information, and also they can introduce and sell more cases over phone. However, personal accounts might get disturbed by calls and prefer to text whenever they have the time so they enable their text chats.
dfc = df.copy()
dfc["promoted_"] = "Not Promoted"
dfc.loc[dfc["promoted"] == True, "promoted_"] = "Promoted"
# fig = px.sunburst(dfc, path=["promoted_"], title='Number of Promoted Posts') # "parties", "has_words", "response"
# fig.update_traces(textinfo="percent entry+label+value+text", )
# fig.show(renderer="png")
fig = px.icicle(
dfc,
path=[px.Constant(f"All Posts"), "promoted_"],
color_discrete_sequence=px.colors.qualitative.Dark2,
width=1300, height=300,
title = "Icicle Chart of Promotions"
)
fig.update_traces(
root_color="lightblue",
tiling = dict(orientation='v'),
textinfo="percent entry+label+value+text",
# marker = {"colors": ["deepskyblue", "lightsalmon", "tan", "teal", "silver"]},
)
fig.show(renderer="png")
dfc = dfc[dfc["promoted"] == True]
fig = px.icicle(
dfc,
path=[px.Constant(f"Promoted Posts"), "business_type"],
color_discrete_sequence=px.colors.qualitative.Dark2,
width=1300, height=400,
title = "Icicle Chart of Business Type of Promoted Posts"
)
fig.update_traces(
root_color="lightblue",
tiling = dict(orientation='v'),
textinfo="percent entry+label+value+text",
# marker = {"colors": ["deepskyblue", "lightsalmon", "tan", "teal", "silver"]},
)
fig.show(renderer="png")
First of all we see that only 3% of the posts were paid to be promoted (فوری). Second, we see that personal accounts had promoted their posts much more than agencies and consultants (85% vs. 15%).
In Q2 we observed that most of the posts are made by agencies and only a few are by personal accounts. Based on Q2 this promotion pattern shows more significant information. Personal accounts pay significantly more than agencies. It shows that first, Divar is getting paid by the small amount of personal users it has rather than those many agencies (However agencies probably pay additional fee too, but they dont pay for promotions). Moreover, agencies pobably have found other ways to put their posts in top of the list like posting one apartment multiple times with slightly different information that they do not need to use the paid options of Divar for making the post more seen.
dfc = df.copy()
dfc = dfc.fillna("No Media")
fig = px.histogram(dfc, x="media", height=300, title="Media Types of Posts", text_auto=True)
fig.show(renderer="png")
The most used version of media on posts is image (camera). Then many do not have any media, and after that just a few add videos or VR of the apartment.
corr = df.corr()
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values, annot=True, cmap='coolwarm')
plt.title("Pairwise Correlations")
plt.tight_layout()
plt.show()
C:\Users\Mohsen\AppData\Local\Temp\ipykernel_23056\2623753099.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
There are no significant correalations between numeric features. Intuitively, for predicting price, we need city+district and how big the house is. In this correlation we do not have the districts (We analysed their effect in Q4 & Q5), and we do not have the meters of the houses because it was not available in the listing. (And we could not request 24000 posts to get all the features).
Nonetheless, I look forward to training models and see if they can predict the features of this dataset with what is available or not and which cities might be eaiser to predict.
Based on this correlation map, the largest one is 0.1 between promoted and has_chat, which actually might be due to business type, becuase we observed that personal accounts enable their chats and also buy most of the promotions, so that might have played a role here to make a slightly positive correlation.
In this section we found Divar's API and requested the pages and saved the UI data.
In this phase, we dropped unused columns and extracted clean data out of UI strings, and fixed data types.
In EDA, we had a comprehensive analysis of the features we had and answered 11 questions based on the data. Some of the interesting points were: